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Abstract 

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which 
for practical purposes is approximated by the length of the compressed version of the file involved, using 
a real-world compression program. This practical application is called 'normalized compression distance' 
and it is trivially computable. It is a parameter-free similarity measure based on compression, and is used 
in pattern recognition, data mining, phylogeny, clustering, and classification. The complexity properties of 
its theoretical precursor, the NID, have been open. We show that the NID is neither upper semicomputable 
nor lower semicomputable. 

Index Terms — Normalized information distance, Kolmogorov complexity, semicomputability. 

I. Introduction 

The classical notion of Kolmogorov complexity [8] is an objective measure for the information in 
a single object, and information distance measures the information between a pair of objects [2]. This 
last notion has spawned research in the theoretical direction, among others [3], [15], [16], [17], [12], 
[14]. Research in the practical direction has focused on the normalized information distance (NID), also 
called the similarity metric, which arises by normalizing the information distance in a proper manner. 
(The NID is defined by dll.ll ) below.) If we also approximate the Kolmogorov complexity through real- 
world compressors [10], [4], [5], then we obtain the normalized compression distance (NCD). This 
is a parameter-free, feature-free, and alignment-free similarity measure that has had great impact in 
applications. The NCD was preceded by a related nonoptimal distance [9]. In [7] another variant of the 
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NCD has been tested on all major time-sequence databases used in all major data-mining conferences 
against all other major methods used. The compression method turned out to be competitive in general 
and superior in heterogeneous data clustering and anomaly detection. There have been many applications 
in pattern recognition, phylogeny, clustering, and classification, ranging from hurricane forecasting and 
music to to genomics and analysis of network traffic, see the many papers referencing [10], [4], [5] in 
Google Scholar. The NCD is trivially computable. In [10] it is shown that its theoretical precursor, the 
NID, is a metric up to negligible discrepancies in the metric (in)equalities and that it is always between 
and 1. (For the subsequent computabihty notions see Section HI!) 

The computabihty status of the NID has been open, see Remark VI. 1 in [10] which asks whether the 
NID is upper semicomputable, and (open) Exercise 8.4.4 (c) in the textbook [11] which asks whether 
the NID is semicomputable at all. We resolve this question by showing the following. 

Theorem 1.1: Let x,y be strings and denote the NID between them by e{x,y). 

(i) The function e is not lower semicomputable (Lemma I3.3I) . 

(ii) The function e is not upper semicomputable (Lemma |4.1| ). 

Item (i) implies that there is no pair of lower semicomputable functions g, 6 such that g{x, y)+6{x, y) = 
e{x,y). (If there were such a pair, then e itself would be lower semicomputable.) Similarly, Item (ii) 
implies that there is no pair of upper semicomputable functions g, 6 such that g{x, y) + 5{x, y) = e{x, y). 
Therefore, the theorem implies 

Corollary 1.2: (i) The NID e(x, y) cannot be approximated by a semicomputable function g{x, y) to 
any computable precision 6{x,y). 

(ii) The NID e(x, y) cannot be approximated by a computable function g{x, y) to any semicomputable 
precision 5{x, y). 

How can this be reconciled with the above applicability of the NCD (an approximation of the NID 
through real-world compressors)? It can be speculated upon but not proven that natural data do not contain 
complex mathematical regularities such as vr = 3.1415 ... or a universal Turing machine computation. The 
regularities they do contain are of the sort detected by a good compressor. In this view, the Kolmogorov 
complexity and the length of the result of a good compressor are not that different for natural data. 

II. Preliminaries 

We write string to mean a finite binary string, and e denotes the empty string. The length of a string 
X (the number of bits in it) is denoted by \x\. Thus, |e| = 0. Moreover, we identify strings with natural 
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numbers by associating each string with its index in the length-increasing lexicographic ordering 

(e, 0), (0, 1), (1, 2), (00, 3), (01, 4), (10, 5), (11, 6), . . . . 

Informally, the Kolmogorov complexity of a string is the length of the shortest string from which the 
original string can be losslessly reconstructed by an effective general-purpose computer such as a particular 
universal Turing machine U, [8]. Hence it constitutes a lower bound on how far a lossless compression 
program can compress. In this paper we require that the set of programs of U is prefix free (no program 
is a proper prefix of another program), that is, we deal with the prefix Kolmogorov complexity. (But for 
the results in this paper it does not matter whether we use the plain Kolmogorov complexity or the prefix 
Kolmogorov complexity.) We call U the reference universal Turing machine. Formally, the conditional 
prefix Kolmogorov complexity K{x\y) is the length of the shortest input z such that the reference universal 
Turing machine U on input z with auxihary information y outputs x. The unconditional prefix Kolmogorov 
complexity K{x) is defined by K{x\e). For an introduction to the definitions and notions of Kolmogorov 
complexity (algorithmic information theory) see [11]. 

Let Af and TZ denote the noimegative integers and the real numbers, respectively. A function f : Af ^ 
TZ is upper semicomputable (or 11°) if it is defined by a rational- valued computable function 4>{x,k) 
where x is a string and k is a nonnegative integer such that (p{x,k + 1) < (i){x,k) for every k and 
limfc^oo 4>{x, k) = f{x). This means that / can be computably approximated from above. A function / 
is lower semicomputable (or S^) if — / is upper semicomputable. A function is called semicomputable 
(or (J Sj) if it is either upper semicomputable or lower semicomputable or both. A function / is 
computable (or recursive) iff it is both upper semicomputable and lower semicomputable (or IlJ H Sj). 
Use (•) as a pairing fi^nction over A/" to associate a unique natural number (x, y) with each pair (x, y) of 
natural numbers. An example is {x, y) defined by y -I- (x -I- y -I- l)(x -I- y)/2. In this way we can extend 
the above definitions to functions of two noimegative integers, in particular to distance functions. 

The information distance D{x,y) between strings x and y is defined as 

D{x,y) = min{|p| : U{p,x) = yAU{p,y) = x}, 
p 

where U is the reference universal Turing machine above. Like the Kolmogorov complexity K, the 
distance function D is upper semicomputable. Define 

E{x,y) = maji{K{x\y),K{y\x)}. 
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In [2] it is sliown that the function E is upper semicomputable, D{x, y) = E{x, y) + 0(log E{x, y)), the 
function £' is a metric (more precisely, that it satisfies the metric (in)equaUties up to a constant), and 
that E is minimal (up to a constant) among all upper semicomputable distance functions D' satisfying 
the mild normalization conditions Yliyy^x 2"^^^'^-* < 1 and Ylix-x^y 2~^^^'^^ < 1. (Here and elsewhere 
in this paper "log" denotes the binary logarithm.) It should be mentioned that the miiumaUty property 
was relaxed from the D' functions being metrics [2] to symmetric distances [10] to the present form [II] 
without serious proof changes. The normalized information distance (NID) e is defined by 

It is straightforward that < e{x,y) < 1 up to some minor discrepancies for all x,y e {0, 1}*. Since e 
is the ratio between two upper semicomputable functions, that is, between two IlJ functions, it is a A2 
function. That is, e is computable relative to the halting problem 0'. One would not expect any better 
bound in the arithmetic hierarchy. However, we can say this: Call a function f{x,y) computable in the 
limit if there exists a rational-valued computable function g{x, y, t) such that lim^^oo g{x, y, t) = f{x, y). 
This is precisely the class of functions that are Turing-reducible to the halting set, and the NID is in this 
class. Exercise 8.4.4 (b) in [II] (a result due to [6]). 

In the sequel we use time-bounded Kolmogorov complexity. Let x be a string of length n and t{n) a 
computable time bound. Then denotes the time-bounded version of K defined by 

K*{x\y) = min{|p| : U'{p, y) = x in at most t{n) steps}. 
p 

Here we use the two work-tape reference universal Turing machine U' suitable for time-bounded 
Kolmogorov complexity [II]. The computation of U' is measured in terms of the output rather than 
the input, which is more natural in the context of Kolmogorov complexity. 

III. The NID IS NOT LOWER SEMICOMPUTABLE 

Define the time-bounded version E^ of E by 

E\x,y) = mao^{K\x\y),K\y\x)}. (HI.I) 



Lemma 3.1: For every length n and computable time bound t there are strings u and v of length n 
such that 
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• K{v) > n — ci, 

• K{v\u) > n — C2, 

• K{u\n) < C2, 

• K^{u\v) > n — ci logn — C2, 

where ci is a noimegative constant independent of t,n, and C2 is a nonnegative constant depending on t 
but not on n. 

Proof: Fix an integer n. There is a v of length n such that K{v\n) > n by simple counting (there 
are 2" strings of length n and at most 2" — 1 programs of length less than n). If we have a program 
for V then we can turn it into a program for v ignoring conditional information by adding a constant 
number of bits. Hence, K{v) + c > K{v\n) for some nonnegative constant c. Therefore, for large enough 
nonnegative constant ci we have 

K{v) > n — ci. 

Let t be a computable time bound and let the computable time bound t' be large enough with respect to 
t so that the arguments below hold. Use the reference universal Turing machine U' with input n to run 
all programs of length less than n for t'{n) steps. Take the least string u of length n not occurring as an 
output among the halting programs. Since there are at most 2^* — 1 programs as above, and 2" strings of 
length n there is always such a string u. By construction {u\n) > n and for a large enough constant 
C2 also 

K{u\n) < C2, 

where C2 depends on t' (hence t) but not on n, u. Since u in the conditional only supplies C2 bits apart 
from its length n we have 

K{v\u) > K{v\n) — K{u\n) >n — C2- 
This implies also that K*' {v\u) >n — C2. Hence, 

2ra - C2 < K^\u\n) + K^' {v\u). 

Now we use the time-bounded symmetry of algorithmic information [13] (see also [11], Exercise 7.1.12) 
where t is given and t' is choosen in the standard proof of the symmetry of algorithmic information [11], 
Section 2.8.2 (the original is due to L.A. Levin and A.N. Kolmogorov in [18]), so that the statements 
below hold. (Recall also that for large enough /, K^{v\u,n) = Kf{v\u) and K'f {u\v,n) = Kf{u\v) 
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since in the original formulas n is present in each term.) Then, 

K*' {u\n) + K*' {v\u) - ci logn < K^' {v,u\n), 

with the constant ci large enough and independent of t,t' ,n,u,v. For an appropriate choice of t' with 
respect to t it is easy to see (the simple side of the time-bounded symmetry of algorithmic information) 
that 

K^'iv,u\n) < K\v\n) + K\u\v). 

Since K\v\n) > K{ v\n) > n we obtain K*{u\v) > n — ci logn — C2. ■ 
A similar but tighter result can be obtained from [1], Lemma 7.7. 

Lemma 3.2: For every length n and computable time bound t (provided t{n) > cn for a large enough 
constant c), there exist strings v and w of length n such that 

• K{v) > n — ci, 

• E{v, w) < C3, 

• E*{v, w) > n — ci log n — C3, 

where the nonnegative constant C3 depends on t but not on n and the nonnegative constant ci is 
independent of t,n. 

Proof: Let strings u, v and constants ci, C2 be as in Lemma [3TT] using 2t instead of t, and the constants 
c',c",cs are large enough for the proof below. By Lemma [3TT1 we have K'^^{u\v) > n — ci logn — C2 
with C2 appropriate for the time bound 2t. Define w hy w = v (B u where © denotes the bitwise XOR. 
Then, 

E(v, w) < K{u\n) + c' < C3, 

where the nonnegative constant C3 depends on 2t (since u does) but not on n and the constant c' is 
independent of t, n. We also have n = f ©u; so that (with the time bound t{n) > cn for c a large enough 
constant independent of t,n) 

n — ci log n — C2 < K'^^{u\v) 

< K\w\v) + c' 

K\w\v)} + c" 

= E\v,w) + c", 
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where the nonnegative constants c', c" are independent of t, n. ■ 
Lemma 3.3: The function e is not lower semicomputable. 

Proof: Assume by way of contradiction that the lemma is false. Let be a lower semicomputable 
function approximation of e such that ej+i(x,y) > ei{x,y) for all i and liuii^^ ei{x,y) = e{x,y). Let 
Ei be an upper semicomputable function approximating E such that Ei^i{x,y) < Ei{x,y) for all i and 
limj_j.oo Ei{x, y) = E{x, y). Finally, for x, y are strings of length n let i^.y denote the least i such that 

Ei (x,y) 

e._(x,y)>— Ifl^^, (m.2) 
n + 2 log n + c 

where c is a large enough constant (independent of n,i) such that K{z) < n + 21ogn + c for every 
string z of length n (this follows from the upper bound on K, see [11]). Since the function E is upper 
semicomputable and the function e is lower semicomputable by the contradictory assumption such an 
ix,y exists. Define the function s by s(n) = 

Claim 3.4: The function s{n) is total computable and E^{v, w)) > n — ci \ogn — C3 for some strings 
v,w of length n and constants ci,C3 in Lemma [X2l 

Proof: By the contradictory assumption e is lower semicomputable, and E is upper semicomputable 
since K{-\-) is. Recall also that e{x,y) > E{x,y) / {n + 2\ogn + c) for every pair x,y of strings of length 
n. Hence for every such pair {x,y) we can compute i^^y < 00. Since s{n) is the maximum of 2^" 
computable integers, s{n) is computable as well and total. Then, the claim follows from Lemma [3^ 
(If s{n) happens to be too small to apply Lemma [3^ we increase it total computably until it is large 
enough.) ■ 

Remark 3.5: The string v of length n as defined in the proof of Lemma [3?T] satisfies K{v\n) > n. 
Hence v is incomputable [11]. Similarly this holds forw = v(Bu (defined in Lemma [l!2l ). But above 
we look for a function s(n) such that all pairs x,y of strings of length n (including the incomputable 
strings v^w) satisfy (lin.2l ) with s(n) replacing i^^y. Since the computable function s(n) does not depend 
on the particular strings x, y but only on their length n, we can use it as the computable time bound t 
in Lemmas 13.11 and [X2] to define strings u,v,w of length n. 

For given strings x, y of length n, the value Ei^ ^ {x, y) is not necessarily equal to E^{x, y). Since s{n) 
majorises the ix,j/'s and E is upper semicomputable, we have E^{x,y) < Ei^ ^{x,y), for all pairs {x,y) 
of strings x,y of length n. (} 
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Since K{v) > n — ci we have E{v,w) > e{v,w){n — ci). By the contradictory assumption that e is 
lower semicomputable we have e{v,w) > e^{v,w). By (IIII.2I ) and the definition of s{n) we have 

e {v,w) > . 

n + 2 log n + c 

Hence, 

n + 2 log n + c 

But E{v,w) < C3 by Lemma 13.21 and E'^{v,w) > n — ci logn — C3 by Claim 13.41 which yields the 
required contradiction for large enough n. ■ 

IV. The NID is not upper semicomputable 

Lemma 4.1: The function e is not upper semicomputable. 

Proof: It is easy to show that e(x, x) (and hence e(x, y) in general) is not upper semicomputable. 
For simplicity we use e{x,x) = 1/K{x). Assume that the function 1/K{x) is upper semicomputable 
Then, K{x) is lower semicomputable. Since K{x) is also upper semicomputable, it is computable. But 
this violates the known fact [11] that K{x) is incomputable. ■ 

V. Open Problem 

A subset of is called n-computably enumerable (n-c.e.) if it is a Boolean combination of n 
computably enumerable sets. Thus, the 1-c.e. sets are the computably enumerable sets, the 2-c.e. sets (also 
called d.c.e.) the differences of two c.e. sets, and so on. The n-c.e. sets are referred to as the difference 
hierarchy over the c.e. sets. This is an effective analog of a classical hierarchy from descriptive set theory. 
Note that a set is n-c.e. if it has a computable approximation that changes at most n times. 

We can extend the notion of n-c.e. set to a notion that measures the number of fluctuations of a function 
as follows: For every n > 1, call f : M ^ IZ n-approximable if there is a rational- valued computable 
approximation <j) such that \\mk-^ao 4^{x, k) = f{x) and such that for every x, the number of /c's such that 
(j){x, k + 1) — 4>{x, A;) < is bounded by n — 1. That is, n — 1 is a bound on the number of fluctuations of 
the approximation. Note that the 1-approximable functions are precisely the lower semicomputable (S^) 
ones (zero fluctuations). Also note that a set yl C TV^ is n-c.e. if and only if the characteristic function 
of A is n-approximable. 

Conjecture For every n > 1, the normalized information distance e is not n-approximable. 
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